Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't panic and trigger recovery when applying cancel command for created job #19291

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

yezizp2012
Copy link
Member

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

// Otherwise our persisted state is dirty.
let mut table_ids = table_fragments.internal_table_ids();
table_ids.push(table_id);
mgr.catalog_manager.assert_tables_deleted(table_ids).await;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error returned by cancel_create_materialized_view_procedure only occurs in situations where: 1. Writing to the metastore fails; 2. The job has already been successfully created. When the job has been successfully created, this assertion will cause a meta panic. Because the cancel command has already stopped the actor on CNs, here directly throw an error to let recovery rebuild.

// It won't clean the tables on failure,
// since the failure could be recoverable.
// As such it needs to be handled here.
self.barrier_manager_context
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the job is already created, unregistering from hummock will lead to data inconsistency. Meta will crash loop during commit epoch because of missing state table id. We only do unregister after the catalog is successfully deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed in main/2.1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the behavior already changed for SQL backend in main/2.1.

@graphite-app graphite-app bot requested a review from lmatz November 7, 2024 09:43
Copy link

graphite-app bot commented Nov 7, 2024

Graphite Automations

"release branch request review" took an action on this PR • (11/07/24)

1 reviewer was added to this PR based on xxchan's automation.

@yezizp2012 yezizp2012 added this pull request to the merge queue Nov 8, 2024
Merged via the queue into release-2.0 with commit a8aa71c Nov 8, 2024
27 of 28 checks passed
@yezizp2012 yezizp2012 deleted the fix/cancel-panic-inconsistent branch November 8, 2024 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants